Random Forest

Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma

Random Forest Introduction

  • Ensemble machine learning method based on a large number of decision trees voting to predict a classification
  • Benefits compared to decision tree:
    • Able to function with incomplete data
    • Lower likelihood of an overfit
    • Improved prediction accuracy

Random Forest Applications

  • Banking
    • Fraud Detection
    • Loan Default Risk
  • Business
    • Predictive Advertising

Methods

  • Bagging
  • Boosting
  • Random Feature Selection
  • Cross Validation
  • Ensemble voting and Prediction

Bootstrap Sampling (Bagging)

  • Each decision tree uses a random sample of the original dataset
    • Using a subset of the dataset reduces the probability of an overfit model
    • Rows with missing data will often be left out of the sample, improving performance
    • Performed with replacement

Boosting

  • When individual models are trained in a sequential way, each model then learns the mistakes made by preceding model.

Random Feature Selection

  • A random set of features is selected for each node in training
    • Information about feature importance may be saved and applies in future iterations
    • Even with automated random feature selection, feature selection and engineering prior to training may improve performance

Code
ctrl <- trainControl(method = "cv",  number = 10) 

bagged_cv <- train(
  Group~ LF + GF + Biomes + Range + Habitat_degradation +  
     Habitat_loss + IAS + Other + Unknown + Other + Over_exploitation,
  data    = species_train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE)

plot(varImp(bagged_cv), 10)

Cross Validation

  • Validation of performance of model
    • Resampling method similar to bootstrapping, but without replacement
    • Allows approximation of the general performance of a model

Code
 m3 <- rpart(
   formula = Group~ LF + GF + Biomes + Range +
     Habitat_degradation + Habitat_loss + IAS +
     Other + Unknown + Other + Over_exploitation,
   data    = species_train,
   method  = "anova"
 )
 rpart.plot(m3)

Ensemble Voting and Prediction

  • Each trained decision tree produces its own prediction
    • Decision trees are independent, and were trained on different subsets of both data and features
  • The results from each decision tree are combined into a voting classifier
    • The mode of the classification results will be the final prediction

Data Preparation

  • Preprocessing
    • Encode categorical features into numerical / factor features
    • Split the training set into a training and test set, minimizing class imbalance

Preprocessing

  • Class Imbalance
    • Resample smaller classes in order to approximate equal classes
    • Training on imbalanced datasets will bias predictions to the larger class

Normalization

  • \(X_{new}=\frac{X_{old}-\min(X_{old})}{\max(X_{old})-\min(x_{old})}\)
  • \(X_{new}=\frac{X_{old}-\bar{X}_{old}}{\sigma_{X_{old}}}\)
  • \(X_{new}=\frac{X_{old}}{\max(|X_{old}|)}\)
  • \(X_{new}=\frac{X_{old}}{\sum|X_{old}|}\)
  • \(X_{new}=\frac{X_{old}}{\sqrt{\sum(X_{old}^2))}}\)

Prediction

  • Combine results into a vector
    • \(Y=\{y_1,y_2,y_3,y_4,y_5\}\)
  • Identify the most frequently predicted class
    • \(y_{final}=\text{mode}(Y)\)
  • Iterate over entire test set, storing results
  • Generate a confusion matrix, calculate the sensitivity, and precision for each category
  • Iterate after tuning if necessary

##Evaluation - Four metrics are calculated using the test set - \(\text{Accuracy}=\frac{\sum{\left(\text{Actual Label} = \text{Predicted Label}\right)}}{\text{Label Count}}\) - \(\text{Recall}=\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\) - \(\text{Precision}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}\) - \(\text{F1}=\frac{2*(\text{Precision}*\text{Recall})}{\text{Precision}+\text{Recall}}\)

Dataset and Exploration

  • South African Red List
    • Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
  • Purpose
    • Predict whether or not an unknown plant is threatened based on the above characteristics

Distribution of Range by Conservation Status

  • While there are a small number of threatened species with a large range, it is clear that Range is likely a strong predictor of Group status

  • A lower range predicts a higher likelihood of threatened or extinct grouping.

Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +
  geom_boxplot() +
  theme_bw() +
  ylim(0,100000)

Feature Associations

  • Cramer’s V Association with Range binned into 20 categories
    • Target feature Group is most associated with Range, Family, Habitat Loss, Biome, and GF
    • The most associated features will likely be the most important features during model training
    • Colinearity does not appear to be present, further checks are

Code
#Binning Range to make it categorical
corrDFRange <- corrDFRange %>% mutate(Range=ntile(Range, n=20))
corrplot::corrplot(DescTools::PairApply(corrDFRange,DescTools::CramerV), type='lower')

Analysis

  • The data was processed to allow it to be modeled effectively using a random forest
  • 5 separate random forest models were created using separate methods of normalization

Data Processing

  • Process the data by setting the first 14 columns as [features] and the last column as the [label]
  • Split the dataset into training and testing sets
  • Combine the training datasets
  • Print the initial number of each category
Code
features <- data[, 1:14]
label <- data[, 15]

set.seed(42)

split <- sample.split(label, SplitRatio = 0.7)
features_train = features[split,]
features_test = features[!split,]
label_train = label[split]
label_test = label[!split]

data_train <- features_train
data_train$label <- label_train
class_counts <- table(data_train$label)

class_counts <- table(data_train$label)
print(paste("( Before )Data Category Counts: ", class_counts))

Handle class imbalance

  • Process classes A and B
  • Process classes A and C
  • Retain records in data_train_AB_resampled where the label is ‘2’
  • Retain records in data_train_AC_resampled where the label is ‘3’
  • Retain records in both data_train_AB_resampled and data_train_AC_resampled where the label is ‘1’
  • combine
  • Print the number of each category after class imbalance handling
Code
data_train_AB <- data_train
data_train_AB <- data_train_AB[data_train_AB$label != '3',]
data_train_AB_resampled <- ovun.sample(label ~ ., data = data_train_AB, method = "over", N = 980, seed = 1)$data

data_train_AC <- data_train
data_train_AC <- data_train_AC[data_train_AC$label != '2',]
data_train_AC_resampled <- ovun.sample(label ~ ., data = data_train_AC, method = "over", N = 980, seed = 1)$data

data_train_AB_2 <- data_train_AB_resampled[data_train_AB_resampled$label == '2',]
data_train_AC_3 <- data_train_AC_resampled[data_train_AC_resampled$label == '3',]

data_train_1 <- data_train_AB_resampled[data_train_AB_resampled$label == '1',]
data_train_combined <- rbind(data_train_1, data_train_AB_2, data_train_AC_3)

cat("( After )Data Category Counts:\n")
print(table(data_train_combined$label))

Normalization

  • Divide the features and label, and apply different normalization to the training and testing sets
  • Apply Min-Max normalization to features_train and features_test
  • Apply Z-Score normalization to each column of features_train
  • features_test <- as.data.frame(mapply(function(x, y) {(x - mean(y))/sd(y)}, features_test, features_train, SIMPLIFY = FALSE))
  • Apply Max Absolute Value normalization to the training set
  • Apply L1 norm normalization to the training set
  • Apply L2 norm normalization to the training set
Code
features_train <- as.data.frame(lapply(features_train, function(x) {(x-min(x))/(max(x)-min(x))}))
features_test <- as.data.frame(lapply(features_test, function(x) {(x-min(x))/(max(x)-min(x))}))

features_train_1 <- as.data.frame(lapply(features_train, function(x) {(x-min(x))/(max(x)-min(x))}))
features_test_1 <- as.data.frame(lapply(features_test, function(x) {(x-min(x))/(max(x)-min(x))}))

features_train_2 <- as.data.frame(lapply(features_train, function(x) {(x - mean(x))/sd(x)}))
features_test_2 <- as.data.frame(lapply(features_test, function(x) {(x - mean(x))/sd(x)}))

features_train_3 <- as.data.frame(lapply(features_train, function(x) {x / max(abs(x))}))
features_test_3 <- as.data.frame(lapply(features_test, function(x) {x / max(abs(x))}))

features_train_4 <- as.data.frame(lapply(features_train, function(x) {x / sum(abs(x))}))
features_test_4 <- as.data.frame(lapply(features_test, function(x) {x / sum(abs(x))}))

features_train_5 <- as.data.frame(lapply(features_train, function(x) {x / sqrt(sum(x^2))}))
features_test_5 <- as.data.frame(lapply(features_test, function(x) {x / sqrt(sum(x^2))}))

Model Training

model_1 <- randomForest(x = data_train_1[-ncol(data_train_combined)], y = as.factor(data_train_1$label), ntree = 2)
variable_importance_1 = importance(model_1)
pred_comb_1 <- predict(model_1, features_test_1)
accuracy <- sum(label_test == pred_comb_1) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_1)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"], cm$byClass["Class: 2", "Sensitivity"], cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"], cm$byClass["Class: 2", "Pos Pred Value"], cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),round(precision,2),round(F1,2)),ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)
          Score
Accuracy   0.93
Recall     0.87
Precision  0.89
F1         0.88
model_2 <- randomForest(x = data_train_2[-ncol(data_train_combined)], y = as.factor(data_train_2$label), ntree = 2) # nolint
variable_importance_2 = importance(model_2) # nolint
pred_comb_2 <- predict(model_2, features_test_2)
accuracy <- sum(label_test == pred_comb_2) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_2)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],  # nolint
                              cm$byClass["Class: 2", "Sensitivity"],  # nolint
                              cm$byClass["Class: 3", "Sensitivity"]))

precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"], # nolint 
                            cm$byClass["Class: 2", "Pos Pred Value"], # nolint 
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision ) # nolint
printTable=matrix(c(round(accuracy,2),round(recall,2),round(precision,2),round(F1,2)),ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)
          Score
Accuracy   0.33
Recall     0.48
Precision  0.37
F1         0.42
model_3 <- randomForest(x = data_train_3[-ncol(data_train_combined)], y = as.factor(data_train_3$label), ntree = 2)
variable_importance_3 = importance(model_3)
pred_comb_3 <- predict(model_3, features_test_3)
accuracy <- sum(label_test == pred_comb_3) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_3)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),round(precision,2),round(F1,2)),ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)
          Score
Accuracy   0.89
Recall     0.80
Precision  0.80
F1         0.80
model_4 <- randomForest(x = data_train_4[-ncol(data_train_combined)], y = as.factor(data_train_4$label), ntree = 2)
variable_importance_4 = importance(model_4)
pred_comb_4 <- predict(model_4, features_test_4)
accuracy <- sum(label_test == pred_comb_4) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_4)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),round(precision,2),round(F1,2)),ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)
          Score
Accuracy   0.76
Recall     0.59
Precision  0.60
F1         0.59
model_5 <- randomForest(x = data_train_5[-ncol(data_train_combined)], y = as.factor(data_train_5$label), ntree = 2)
variable_importance_5 = importance(model_5)
pred_comb_5 <- predict(model_5, features_test_5)
accuracy <- sum(label_test == pred_comb_5) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_5)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),round(precision,2),round(F1,2)),ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)
          Score
Accuracy   0.71
Recall     0.51
Precision  0.48
F1         0.49

Prediction

  • Process classes A and B
  • Process classes A and C
  • Retain records in data_train_AB_resampled where the label is ‘2’
  • Retain records in data_train_AC_resampled where the label is ‘3’
  • Retain records in both data_train_AB_resampled and data_train_AC_resampled where the label is ‘1’
  • combine
  • Print the number of each category after class imbalance handling
Code
data_train_AB <- data_train
data_train_AB <- data_train_AB[data_train_AB$label != '3',]
data_train_AB_resampled <- ovun.sample(label ~ ., data = data_train_AB, method = "over", N = 980, seed = 1)$data

data_train_AC <- data_train
data_train_AC <- data_train_AC[data_train_AC$label != '2',]
data_train_AC_resampled <- ovun.sample(label ~ ., data = data_train_AC, method = "over", N = 980, seed = 1)$data

data_train_AB_2 <- data_train_AB_resampled[data_train_AB_resampled$label == '2',]
data_train_AC_3 <- data_train_AC_resampled[data_train_AC_resampled$label == '3',]

data_train_1 <- data_train_AB_resampled[data_train_AB_resampled$label == '1',]
data_train_combined <- rbind(data_train_1, data_train_AB_2, data_train_AC_3)

cat("( After )Data Category Counts:\n")
print(table(data_train_combined$label))

Confusion Matrix

  • Process classes A and B
  • Process classes A and C
  • Retain records in data_train_AB_resampled where the label is ‘2’
  • Retain records in data_train_AC_resampled where the label is ‘3’
  • Retain records in both data_train_AB_resampled and data_train_AC_resampled where the label is ‘1’
  • combine
  • Print the number of each category after class imbalance handling
Code
data_train_AB <- data_train
data_train_AB <- data_train_AB[data_train_AB$label != '3',]
data_train_AB_resampled <- ovun.sample(label ~ ., data = data_train_AB, method = "over", N = 980, seed = 1)$data

data_train_AC <- data_train
data_train_AC <- data_train_AC[data_train_AC$label != '2',]
data_train_AC_resampled <- ovun.sample(label ~ ., data = data_train_AC, method = "over", N = 980, seed = 1)$data

data_train_AB_2 <- data_train_AB_resampled[data_train_AB_resampled$label == '2',]
data_train_AC_3 <- data_train_AC_resampled[data_train_AC_resampled$label == '3',]

data_train_1 <- data_train_AB_resampled[data_train_AB_resampled$label == '1',]
data_train_combined <- rbind(data_train_1, data_train_AB_2, data_train_AC_3)

cat("( After )Data Category Counts:\n")
print(table(data_train_combined$label))

Conclusion

  • Range was found to be the strongest predictor of extinction
  • Habitat loss is the second strongest predictor of extinction